14 research outputs found

    No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration

    Full text link
    Recognizing who is speaking in a crowded scene is a key challenge towards the understanding of the social interactions going on within. Detecting speaking status from body movement alone opens the door for the analysis of social scenes in which personal audio is not obtainable. Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way. When considering the video modality, in action recognition problems, a bounding box is traditionally used to localize and segment out the target subject, to then recognize the action taking place within it. However, cross-contamination, occlusion, and the articulated nature of the human body, make this approach challenging in a crowded scene. Here, we leverage articulated body poses for subject localization and in the subsequent speech detection stage. We show that the selection of local features around pose keypoints has a positive effect on generalization performance while also significantly reducing the number of local features considered, making for a more efficient method. Using two in-the-wild datasets with different viewpoints of subjects, we investigate the role of cross-contamination in this effect. We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods

    Who is where? Matching people in video to wearable acceleration during crowded mingling events

    Get PDF
    ConferenciaWe address the challenging problem of associating acceler- ation data from a wearable sensor with the corresponding spatio-temporal region of a person in video during crowded mingling scenarios. This is an important rst step for multi- sensor behavior analysis using these two modalities. Clearly, as the numbers of people in a scene increases, there is also a need to robustly and automatically associate a region of the video with each person's device. We propose a hierarchi- cal association approach which exploits the spatial context of the scene, outperforming the state-of-the-art approaches signi cantly. Moreover, we present experiments on match- ing from 3 to more than 130 acceleration and video streams which, to our knowledge, is signi cantly larger than prior works where only up to 5 device streams are associated

    Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild

    Full text link
    Laughter is considered one of the most overt signals of joy. Laughter is well-recognized as a multimodal phenomenon but is most commonly detected by sensing the sound of laughter. It is unclear how perception and annotation of laughter differ when annotated from other modalities like video, via the body movements of laughter. In this paper we take a first step in this direction by asking if and how well laughter can be annotated when only audio, only video (containing full body movement information) or audiovisual modalities are available to annotators. We ask whether annotations of laughter are congruent across modalities, and compare the effect that labeling modality has on machine learning model performance. We compare annotations and models for laughter detection, intensity estimation, and segmentation, three tasks common in previous studies of laughter. Our analysis of more than 4000 annotations acquired from 48 annotators revealed evidence for incongruity in the perception of laughter, and its intensity between modalities. Further analysis of annotations against consolidated audiovisual reference annotations revealed that recall was lower on average for video when compared to the audio condition, but tended to increase with the intensity of the laughter samples. Our machine learning experiments compared the performance of state-of-the-art unimodal (audio-based, video-based and acceleration-based) and multi-modal models for different combinations of input modalities, training label modality, and testing label modality. Models with video and acceleration inputs had similar performance regardless of training label modality, suggesting that it may be entirely appropriate to train models for laughter detection from body movements using video-acquired labels, despite their lower inter-rater agreement

    Estimating self-assessed personality from body movements and proximity in crowded mingling scenarios

    Get PDF
    ArtículoThis paper focuses on the automatic classi cation of self- assessed personality traits from the HEXACO inventory du- ring crowded mingle scenarios. We exploit acceleration and proximity data from a wearable device hung around the neck. Unlike most state-of-the-art studies, addressing per- sonality estimation during mingle scenarios provides a cha- llenging social context as people interact dynamically and freely in a face-to-face setting. While many former studies use audio to extract speech-related features, we present a novel method of extracting an individual's speaking status from a single body worn triaxial accelerometer which scales easily to large populations. Moreover, by fusing both speech and movement energy related cues from just acceleration, our experimental results show improvements on the estima- tion of Humility over features extracted from a single behav- ioral modality. We validated our method on 71 participants where we obtained an accuracy of 69% for Honesty, Consci- entiousness and Openness to Experience. To our knowledge, this is the largest validation of personality estimation carried out in such a social context with simple wearable sensors

    A Hierarchical Approach for Associating Body-Worn Sensors to Video Regions in Crowded Mingling Scenarios

    No full text

    Towards Analyzing and Predicting the Experience of Live Performances with Wearable Sensing

    No full text
    We present an approach to interpret the response of audiences to live performances by processing mobile sensor data. We apply our method on three different datasets obtained from three live performances, where each audience member wore a single tri-axial accelerometer and proximity sensor embedded inside a smart sensor pack. Using these sensor data, we developed a novel approach to predict audience members' self-reported experience of the performances in terms of enjoyment, immersion, willingness to recommend the event to others and change in mood. The proposed method uses an unsupervised method to identify informative intervals of the event, using the linkage of the audience members' bodily movements, and uses data from these intervals only to estimate the audience members' experience. We also analyze how the relative location of members of the audience can affect their experience and present an automatic way of recovering neighborhood information based on proximity sensors. We further show that the linkage of the audience members' bodily movements is informative of memorable moments which were later reported by the audience

    The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates

    Get PDF
    We present MatchNMingle, a novel multimodal/multisensor dataset for the analysis of free-standing conversational groups and speed-dates in-the-wild. MatchNMingle leverages the use of wearable devices and overhead cameras to record social interactions of 92 people during real-life speed-dates, followed by a cocktail party. To our knowledge, MatchNMingle has the largest number of participants, longest recording time and largest set of manual annotations for social actions available in this context in a real-life scenario. It consists of 2 hours of data from wearable acceleration, binary proximity, video, audio, personality surveys, frontal pictures and speed-date responses. Participants' positions and group formations were manually annotated; as were social actions (eg. speaking, hand gesture) for 30 minutes at 20fps making it the first dataset to incorporate the annotation of such cues in this context. We present an empirical analysis of the performance of crowdsourcing workers against trained annotators in simple and complex annotation tasks, founding that although efficient for simple tasks, using crowdsourcing workers for more complex tasks like social action annotation led to additional overhead and poor inter-annotator agreement compared to trained annotators (differences up to 0.4 in Fleiss' Kappa coefficients). We also provide example experiments of how MatchNMingle can be used

    Listen to the real experts: Detecting need of caregiver response in a NICU using multimodal monitoring signals

    Get PDF
    Vital signs are used in Neonatal Intensive Care Units (NICUs) to monitor the state of multiple patients at once. Alarms are triggered if a vital sign is below/above a predefined threshold. Numerous alarms sound each hour which could translate into an overload for the medical team, known as alarm fatigue. Yet many of these alarms do not require immediate clinical action of the caregivers. In this paper we automatically detect moments that need an immediate response (i.e. interaction with the patient) of the medical team in NICUs by using caregiver response to the patient, which is based on the interpretation of vital signs and of nonverbal cues (e.g. movements) delivered by patients. The ultimate goal of such approach is to reduce the overload of alarms while maintaining the patient safety. We use features extracted from the electrocardiogram (ECG) and pulse oxymetry (SpO2) sensors of the patient, as most unplanned interactions between patient and caregivers are due to deteriorations. Since in our unit an alarm can only be paused or silenced manually at the bedside, we used this information as a prior for caregiver response. We also propose different labeling schemes for classification, each representative of a possible interaction scenario within the nature of our problem. We accomplished a general detection of caregiver response with a mean AUC of 0.82. We also show that when trained only with stable and truly deteriorating (critical state) samples, the classifiers can better learn the difference between alarms that need no immediate response and those that do. In addition, we present an analysis of the posterior probabilities over time for different labeling schemes, and use it to speculate about the reasons behind some failure cases

    Estimation of Heart Rate Directly from ECG Spectrogram in Neonate Intensive Care Units

    No full text
    This paper presents a simple yet novel method to estimate the heart frequency (HF) of neonates directly from the ECG signal, instead of using the RR-interval signals as generally done in clinical practices. From this, the heart rate (HR) can be derived. Thus, we avoid the use of peak detectors and the inherent errors that come with them.Our method leverages the highest Power Spectral Densities (PSD) of the ECG, for the bins around the frequencies related to heart rates for neonates, as they change in time (spectrograms).We tested our approach with the monitoring data of 6 days for 52 patients in a Neonate Intensive Care Unit (NICU) and compared against the HR from a commercial monitor, which produced a sample every second. The comparison showed that 92.4% of the samples have a difference lower than 5bpm. Moreover, we obtained a median MAE (Mean Absolute Error) between subjects equal to 2.28 bpm and a median RMSE (Root Mean Square Error) equal to 5.82 bpm. Although tested for neonates, we hypothesize that this method can also be customized for other populations.Finally, we analyze the failure cases of our method and found a direct co-allocation of errors due to moments with higher PSD in the lower frequencies with the presence of critical alarms related to other physiological systems (e.g. desaturation)

    Prediction of Late-Onset Sepsis in Preterm Infants Using Monitoring Signals and Machine Learning

    No full text
    Objectives: Prediction of late-onset sepsis (onset beyond day 3 of life) in preterm infants, based on multiple patient monitoring signals 24 hours before onset. Design: Continuous high-resolution electrocardiogram and respiration (chest impedance) data from the monitoring signals were extracted and used to create time-interval features representing heart rate variability, respiration, and body motion. For each infant with a blood culture-proven late-onset sepsis, a Cultures, Resuscitation, and Antibiotics Started Here moment was defined. The Cultures, Resuscitation, and Antibiotics Started Here moment served as an anchor point for the prediction analysis. In the group with controls (C), an "equivalent crash moment" was calculated as anchor point, based on comparable gestational and postnatal age. Three common machine learning approaches (logistic regressor, naive Bayes, and nearest mean classifier) were used to binary classify samples of late-onset sepsis from C. For training and evaluation of the three classifiers, a leave-k-subjects-out cross-validation was used. Setting: Level III neonatal ICU. Patients: The patient population consisted of 32 premature infants with sepsis and 32 age-matched control patients. Interventions: No interventions were performed. Measurements and Main Results: For the interval features representing heart rate variability, respiration, and body motion, differences between late-onset sepsis and C were visible up to 5 hours preceding the Cultures, Resuscitation, and Antibiotics Started Here moment. Using a combination of all features, classification of late-onset sepsis and C showed a mean accuracy of 0.79 ± 0.12 and mean precision rate of 0.82 ± 0.18 3 hours before the onset of sepsis. Conclusions: Information from routine patient monitoring can be used to predict sepsis. Specifically, this study shows that a combination of electrocardiogram-based, respiration-based, and motion-based features enables the prediction of late-onset sepsis hours before the clinical crash moment
    corecore